The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are not used. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. Hoover over each entry to display the information used to compute p-values.
We can also find the typical p-value for typical difference in accuracy. Hoover to display the actual model pairs for each point.
Following Chatbot Arena, this is the head-to-head comparisons between all pairs of models, reporting wins, and two types of ties.
We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models, and Elo (technically Bradly-Terry coefficients following Chatbot Arena). These usually have near-perfect correlation.
| model | pass1 | win_rate | elo | |
|---|---|---|---|---|
| 0 | gpt-4-0613+cot | 0.755 | 0.952 | 1540.237 |
| 1 | gpt-4-turbo-2024-04-09+cot | 0.757 | 0.878 | 1380.317 |
| 2 | gpt-3.5-turbo-0613+cot | 0.503 | 0.790 | 1259.673 |
| 3 | gpt-4-0613 | 0.698 | 0.743 | 1203.368 |
| 4 | claude-3-opus-20240229+cot | 0.734 | 0.714 | 1174.246 |
| 5 | gpt-4-turbo-2024-04-09 | 0.685 | 0.712 | 1174.064 |
| 6 | codellama-34b+cot | 0.501 | 0.709 | 1173.731 |
| 7 | codellama-13b+cot | 0.474 | 0.646 | 1115.937 |
| 8 | claude-3-opus-20240229 | 0.642 | 0.593 | 1068.591 |
| 9 | codellama-7b+cot | 0.404 | 0.587 | 1064.422 |
| 10 | codetulu-2-34b | 0.492 | 0.570 | 1049.250 |
| 11 | codellama-34b | 0.472 | 0.555 | 1036.052 |
| 12 | deepseek-base-33b | 0.465 | 0.537 | 1022.381 |
| 13 | deepseek-instruct-33b | 0.465 | 0.512 | 1002.395 |
| 14 | gpt-3.5-turbo-0613 | 0.490 | 0.508 | 1000.000 |
| 15 | codellama-python-34b | 0.439 | 0.507 | 998.455 |
| 16 | phind | 0.472 | 0.500 | 993.536 |
| 17 | codellama-13b | 0.425 | 0.496 | 989.942 |
| 18 | deepseek-base-6.7b | 0.419 | 0.492 | 985.885 |
| 19 | mixtral-8x7b | 0.393 | 0.466 | 965.541 |
| 20 | codellama-python-13b | 0.397 | 0.466 | 965.146 |
| 21 | magicoder-ds-7b | 0.417 | 0.433 | 939.551 |
| 22 | wizard-34b | 0.427 | 0.429 | 937.456 |
| 23 | codellama-python-7b | 0.373 | 0.399 | 912.804 |
| 24 | codellama-7b | 0.360 | 0.386 | 901.755 |
| 25 | mistral-7b | 0.350 | 0.376 | 894.167 |
| 26 | deepseek-instruct-6.7b | 0.374 | 0.355 | 877.088 |
| 27 | phi-2 | 0.316 | 0.352 | 873.544 |
| 28 | wizard-13b | 0.365 | 0.352 | 874.113 |
| 29 | starcoderbase-16b | 0.313 | 0.339 | 863.138 |
| 30 | starcoderbase-7b | 0.297 | 0.291 | 821.371 |
| 31 | phi-1.5 | 0.232 | 0.274 | 806.429 |
| 32 | deepseek-base-1.3b | 0.278 | 0.251 | 783.953 |
| 33 | deepseek-instruct-1.3b | 0.272 | 0.242 | 774.527 |
| 34 | phi-1 | 0.131 | 0.086 | 556.104 |